A geopositioned and evidence-graded pan-species compendium of Mayaro virus occurrence

Mayaro Virus (MAYV) is an emerging health threat in the Americas that can cause febrile illness as well as debilitating arthralgia or arthritis. To better understand the geographic distribution of MAYV risk, we developed a georeferenced database of MAYV occurrence based on peer-reviewed literature and unpublished reports. Here we present this compendium, which includes both point and polygon locations linked to occurrence data documented from its discovery in 1954 until 2022. We describe all methods used to develop the database including data collection, georeferencing, management and quality-control. We also describe a customized grading system used to assess the quality of each study included in our review. The result is a comprehensive, evidence-graded database of confirmed MAYV occurrence in humans, non-human animals, and arthropods to-date, containing 262 geo-positioned occurrences in total. This database - which can be updated over time - may be useful for local spill-over risk assessment, epidemiological modelling to understand key transmission dynamics and drivers of MAYV spread, as well as identification of major surveillance gaps.

Two reviewers independently graded the evidence quality for each study and results were compared to reconcile any differences between the two reviewers. A third-party reviewer adjudicated if the two initial reviewers did not reach consensus. The quality score assigned to each article is included in the Quality_scores_MAYV_compendium.docx document in the Dryad data repository 21 . Geo-positioning of the MaYV occurrence data. All available location information associated with confirmed MAYV occurrences was extracted from the included article and added to the database using previously described methods [12][13][14][15] . We designated each MAYV occurrence as either a point or polygon location based on the spatial resolution provided in the article and estimated the kilometers (km) of uncertainty associated with each georeferenced occurence 22 . For polygons, uncertainty was calculated as the distance from the polygon centroid coordinates to the polygon's furthest boundary. For point locations with well-defined boundaries, the same procedure was followed, whereby the uncertainty encompassed the extent of the location's area. When locations did not have well defined boundaries, uncertainty was calculated as half the distance to the nearest named place 22,23 . Calculation of uncertainty was completed using measurements from Google Maps. When authors provided exact coordinates gathered using GPS, uncertainty was calculated using a georeferencing calculator (http://georeferencing.org/georefcalculator/gc.html). Exact coordinates were only used if authors provided a high level of precision (e.g., precision higher than "minutes" in degrees-minutes-seconds format and similarly high precision for coordinates in decimal degrees). When coordinates were provided at a low precision, we georeferenced the named place instead.
When latitude and longitude coordinates were provided, we verified the coordinates using Google Maps (https://www.google.com/maps). The coordinates were then converted to decimal degrees and added to the database as a point location. If a location was explicitly mentioned in the article and the uncertainty of the location was less than 5 kilometers (e.g., a neighborhood or small town), it was entered into the database as a point location and its centroid coordinates were recorded. We used an online gazetteer (www.geonames.com) as well as Google Maps or ArcGIS (ESRI 2011. ArcGIS Pro: Release 2.6.0. Redlands, CA: Environmental Systems Research Institute), along with contextual information, to verify site locations.
When studies reported MAYV occurrence at a state or county level, we georeferenced the appropriate first level (ADM1), second level (ADM2) or third level (ADM3) administrative divisions. We coded these administrative polygons according to the global administrative unit layers (GAUL) from the Food and Agriculture Organization 24 . If the uncertainty of a specific named location was greater than 5 km (e.g., a large city such as Manaus, Brazil), we assigned this occurrence to a custom polygon created in ArcGIS that encompassed the extent of the location. In the rare case that no specific intra-country location was provided, the record was assigned to its country of occurrence (ADM0). When place names were duplicated (i.e., the ADM1 and ADM2 units had the same name), we used the larger location. For example, if the MAYV case location was reported as "Cusco, Peru, " with no additional information provided, the record was assigned to the Cusco ADM1 polygon. However, if the study specified that the case occurred in the "City of Cusco", the record was assigned to a custom polygon that encompassed the City of Cusco. The centroid coordinates of ADM1, ADM2, and ADM3 polygons, or custom polygons were retrieved from the GeoNames gazetteer whenever possible. If centroid coordinates were not available in GeoNames, they were estimated using Google Maps. The coordinates for each georeference and the methods and source used to obtain the coordinates were documented in the compendium.
Several articles reported the diagnosis of MAYV in human blood samples at urban hospitals. If no relevant information was provided on the study participants (e.g., place of residence), we georeferenced the ADM2 unit in which the hospital was located. In addition, we included several records of tourists that were diagnosed with MAYV upon returning to their countries of origin. When studies explicitly mentioned the location of travel, we georeferenced this location conservatively in order to account for the large uncertainty associated with the place of infection. For example, if a traveler reported visiting several locations in the state of Amazonas, Brazil we georeferenced the entire state.  1 When multiple methods were used with varying validity (e.g., HI and PCR), we assigned a score to the most valid assay that detected MAYV positivity. 2 When no MAYV-positive results were found, we assigned a score of N/A.

Data and Metadata Records
This database is available in the Dryad data repository 21 . Each of the 276 rows represent a unique occurrence of MAYV in a human, non-human animal, or arthropod. Location IDs for points and polygons were assigned to each unique location. The MAYV occurrence database contains the following fields, following best-practice nomenclature as previously documented in georeferenced compendiums of other pathogens 12-15 : 1) Location_ID: A unique identifier was assigned to each georeference. The prefix used in the location ID denoted the georeference type: ADM 0, 1, 2, or 3 for administrative units, CP for custom polygons, and P for point locations. Separate studies with duplicate georeferences were assigned the same Location ID, and duplicates were removed according to the methods described below in the Usage Notes. A shapefile containing the custom polygons (Custom_polys_MAYV_compendium.zip) is available in the data repository 21 . 2) Author_Year: The first author and publication year for each record. 3) Ref_Number: A reference identification number was documented when applicable. A PubMed ID number was recorded for all published studies. If this was not available a DOI, GenBank locus, URL, ProMED identifier, etc., was captured.

4) Year_MAYV_Start:
The earliest year that MAYV infection was detected within the publication was recorded if available. If studies only included a range of years and did not specify the precise year that MAYV was found, this range was documented. Note that this variable refers to infection detection and doesn't infer the onset of infection (particularly in the case of serological-based occurrence studies). 5) Year_MAYV_End: The latest year that MAYV infection was detected within the publication was recorded if available. If studies only included a range of years and did not specify the precise year that MAYV was found, this range was documented. We followed the methods of Hill et al. 25 , when studies did not report any year (i.e., an assumption was made that the case was detected three years before publication). 6) Host_Type: One of three host types was documented for each occurrence: human, non-human animal, or arthropod. If multiple host types were detected with MAYV in the same location, a separate row was included for each host type. 7) Location_Description: We documented relevant information related to the location of the occurrence record. This field included the decisions made during the georeferencing process to reach the final determination regarding the location of each record. 8) Adm0: The country where MAYV occurrence was detected. 9) Adm1: The first level administrative unit where MAYV occurrence was detected (if available). 10) Adm2: The second level administrative unit where MAYV occurrence was detected (if available). 11) GAUL_code: When a MAYV occurrence was georeferenced as an ADM1 or ADM2 administrative polygon, the GAUL code was included. 12) Finer_Res: If finer spatial resolution was documented (e.g., a town, city, or exact coordinates) this was recorded. 13) Location_Type: Each occurrence was documented as either a point or polygon location type, depending on the spatial resolution that was provided. Custom polygons are available as shapefiles (Custom_polys_ MAYV_compendium.zip) in the data repository 21 . These can be opened in GIS software or using statistical packages that handle spatial data. 14) Admin_Level: The administrative level for each polygon location was recorded as either 0 (country level), 1 (first level administrative division), 2 (second level administrative division), or 3 (third level administrative division) depending on the spatial resolution that is provided. If the occurrence was georeferenced as a point location or custom polygon, −999 was recorded.  www.nature.com/scientificdata www.nature.com/scientificdata/ 15) Y_Coord: The longitude coordinate was recorded in decimal degrees. The coordinates were taken verbatim from the article when available. Otherwise, the polygon centroid was recorded. 16) X_Coord: The latitude coordinate was recorded in decimal degrees. The coordinates were taken verbatim from the article when available. Otherwise, the polygon centroid was recorded. 17) Coord_Source: This field describes how the coordinates were determined. Possibilities include the following: a. Exact coordinates provided in the article. b. Polygon centroid coordinates retrieved from GeoNames. c. Location was determined based on the details provided in the article (e.g., a specific neighborhood was mentioned), and centroid coordinates were subsequently determined using Google Maps. 18) Uncertainty_km: The amount of uncertainty associated with the record, measured in km. 19 In addition to the main comma-delimited database (Georefs_MAYV_compendium.csv), three additional files are included as part of the file set which can be found online 21 . These files include: (i) a document containing the quality score and a citation for each of the references included in our review www.nature.com/scientificdata www.nature.com/scientificdata/ (Quality_scores_MAYV_compendium.docx), (ii) a list of duplicate georeferences that were excluded from the database (Duplicate_entries_MAYV_compendium.csv), and (iii) a shapefile of the custom polygons (Custom_ polys_MAYV_compendium.zip). The studies that described only negative MAYV results (i.e., those that are not included in the georeferenced compendium) are indicated by an asterisk in the Quality_scores_MAYV_compendium.docx document.

technical Validation
All georeferencing was completed by one study author and validated by a second author. In the case of a disagreement or discrepancy between the two authors, a third author adjudicated. A location identification was assigned to each unique georeference in the dataset. To ensure that no duplicate georeferences were included in the final dataset, we manually checked each new record that was added to ensure that it was unique. We also used an R script to ensure that all location IDs were unique and that no duplicate coordinates were included in the final dataset. In the case of duplicate georeferences, we retained the record with the highest quality score; if quality scores were identical, we retained the more recent record. The duplicate records are contained in the Duplicate_entries_MAYV_compendium.csv file in the data repository. In addition, we plotted all final georeferences in ArcGIS for visual inspection and checked all data points to ensure they fell on land and within the correct country. As an additional quality check, we used an R script to confirm our visual inspection. We used a land cover raster dataset that classifies each 5x5km pixel according to the majority land cover class within the pixel. Coordinates that fell within any pixel classified as "water" were visually inspected, and if necessary, moved to the nearest land pixel. www.nature.com/scientificdata www.nature.com/scientificdata/

Usage Notes
We identified 145 eligible references for inclusion in our study (see flowchart in Fig. 1) 1,[3][4][5][6]8,11, . The resulting database contains 262 unique geo-positioned MAYV locations worldwide, including 93 unique points and 169 unique polygons (see Fig. 2 for MAYV occurrences by year and region). Therefore, each row in the database represents a unique location where MAYV was detected in humans, non-human animals, or arthropods. Duplicate georeferences from the same host type were removed from the main database (see the Duplicate_entries_MAYV_ compendium.csv file in the online data repository 21 ) following the approach specified in the methods section. Some duplicate georeferences were included if multiple host types (e.g., human and arthropod) were found with MAYV at the same location. For example, Hoch et al. 158 detected MAYV in both humans and arthropods in the ADM2 unit of Belterra, Brazil. Two separate rows (one row for humans and one for arthropods) are included in the compendium with the same georeference; therefore, the database includes 276 rows, each representing a unique occurrence of MAYV in a human, non-human animal, or arthropod. Of these 276 rows, 218 (79%) were in humans, 34 (12%) were in non-human animals, and 24 (9%) were in arthropods. MAYV was reported in 15 countries overall, with the majority occurring in Brazil (n = 134). According to our review, MAYV occurrences are limited to the region between latitude 35 S and 12 N (see Figs. 3-4 for maps illustrating the geographic distribution of MAYV). One article 51 reported MAYV occurrence in Zambia; this occurence was not included in our georeferenced database due to the lack of evidence supporting MAYV circulation outside of the Americas and the potential cross-reactivity with antibodies of other alphaviruses in the Semliki Forest serocomplex 159 .
Studies were included in our systematic review if they reported testing for MAYV occurrence, even if MAYV was not detected. These negative results are not included in the georeferenced compendium, but they can be found in the data repository 21 . Many machine-learning models that are common in the ecological literature are presence-only or presence-background algorithms that rely on "pseudoabsence" data in lieu of true absences. For this reason, the true absence data presented here are potentially valuable for disease modelling. However, any reports of disease absence must be considered carefully as true absence is difficult to establish and false absence data can result in miscalibration of distribution models 160 . Ideally, representative country-wide surveys should be used to ascertain "true" absence locations that can be used in subsequent modelling efforts 161,162 . As with other published compendiums 12 , these curated data derived from published sources are expected to complement and augment other survellance data used by public health agencies, thereby increasing our understanding of the distribution of MAYV in the Americas across multiple host types with a high spatial resolution. These data may also assist in identifying under-sampled regions, and may assist in identifying priority regions for surveillance. The georeferences can also serve as the basis for development of epidemiological models or risk maps that characterize the potential suitability for MAYV occurrence, including the risk of spillover into human populations and the potential influence of climate change on MAYV distribution. For example, the 2013 compendium of dengue virus (DENV) occurrence 13 was used as the basis for a highly cited modelling study that estimated the global distribution of DENV risk 163 . Finally, leveraging the methods and data presented here, this open access database can be updated as additional studies are published that report MAYV in the Americas.
There are several important limitations that must be considered when using this dataset. One significant limitation is the impact of sampling bias on the detection and public reporting of MAYV occurrence. Heterogeneity of public health arboviral survellance systems (including variability in surveillance infrastructure and competing public health demands) and MAYV research activity may skew MAYV detection and reporting by geographic region. Therefore, the absence of MAYV occurrence in some settings may not represent true disease absence, but rather ascertainment bias. This important limitation must be addressed in subsequent modeling studies in order to reduce the effects of sampling bias on model accuracy 164 . Some published studies have proposed an evidence consensus score which quantifies the evidence supporting the presence or absence of a pathogen in a given region 165 . This score can be calculated using multiple evidence categories (e.g., health organization reporting status or health expenditure) which may provide useful evidence of disease presence or absence in areas, including those with more limited arboviral surveillance.
Another limitation of our study is the lack of geographic precision associated with MAYV occurrence records. Many articles did not provide sufficient geographic detail to georeference MAYV records with a high level of precision. We attempted to capture this uncertainty by assigning polygon locations to these records. When a greater level of geographic detail was provided by study authors, we were able to georeference some records as point locations (i.e., locations of MAYV occurrence with less than 5 km of uncertainty).
Finally, an additional limitation is associated with the variable assay validity used to detect MAYV. Some studies reported MAYV presence based only on positive serological assays such as hemagglutination inhibition (HI) tests while other studies provided stronger evidence of MAYV occurrence based on reference neutralization assays or PCR testing. We estimated the strength of evidence of MAYV occurrence using a custom evidence grade which could be used in other studies. The strength of evidence annotated in these data can be considered in future modeling efforts, with certain low-evidence records potentially excluded from models as part of sensitivity analyses. Moreover, the variability of evidence for MAYV occurrence demonstrated here prompts study design considerations for future MAYV research and public health surveillance .